An Architecture for Linguistic and Semantic Analysis on the arXMLiv Corpus

نویسندگان

  • Deyan Ginev
  • Constantin Jucovschi
  • Stefan Anca
  • Mihai Grigore
  • Catalin David
  • Michael Kohlhase
چکیده

The ARXMLIV corpus is a remarkable collection of text containing scientific mathematical discourse. With more than half a million documents, it is an ambitious target for large scale linguistic and semantic analysis, requiring a generalized and distributed approach. In this paper we implement an architecture which solves and automates the issues of knowledge representation and knowledge management, providing an abstraction layer for distributed development of semantic analysis tools. Furthermore, we enable document interaction and visualization and present current implementations of semantic tools and follow-up applications using this architecture. We identify five different stages, or purposes, which such architecture needs to address, encapsulating each in an independent module. These stages are determined by the different properties of the document formats used, as well as the state of processing and linguistic enrichment introduced so far. We discuss the need of migration between XML representations and the challenges it would pose on our system, revealing the benefits and trade-off of each format we employ. In the heart of the architecture lies the Semantic Blackboard module. The Semantic Blackboard comprises a system based on a centralized RDF database which can facilitate distributed corpus analysis of arbitrary applications, or analysis modules. This is achieved by providing a document abstraction layer and a mechanism for storing, reusing and communicating results via RDF stand-off annotations deposited in the central database. Achieving a properly encapsulated and automated pipeline from the input corpus document to a semantically enriched output in a state-of-the-art representation is the task of the Preprocessing, Semantic Result and Output Generation modules. Each of them addresses the task of format migration and enhances the document for further semantic enrichment or aggregation. The fifth module, targeting Visualization and Feedback, enables user interaction and display of different stages of processing. The overall architecture purpose is to facilitate the development and execution of semantic analysis tools for the ARXMLIV corpus, automating the migration of knowledge representation and establishing a complete pipeline to both a presentation and content enriched document representation. Additionally, we present three applications based on this architecture. Mathematical Formula Disambiguation (MFD) embodies an analysis module that uses heuristic pattern matching to disambiguate symbol and structure semantics. Context Based Formula Understanding (CBFU) is another Semantic Blackboard module which in turn focuses on establishing context relationships between symbols, helping to disambiguate their semantics. We also present the Applicable Theorem Search (ATS) system, a follow-up application that performs search functions, retrieving theorem preconditions for the user.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Analysis of The Relationship Between Theoretical Aesthetic Ideas And Modern- Postmodern Architectural Styles; (A Comparative Study Of Modern And Postmodern Architecture)ِِِ

Physical attributes have always been a qualitative indicator for evaluating an architectural work. These character influenced by function, technology and changing the process of creation and perception of beauty in modern times; and influenced by content, culture, history, meaning and symbolic linguistic structures in the postmodern era. In accordance with the evolution of aesthetic theories si...

متن کامل

برچسب‌زنی نقش معنایی جملات فارسی با رویکرد یادگیری مبتنی بر حافظه

Abstract Extracting semantic roles is one of the major steps in representing text meaning. It refers to finding the semantic relations between a predicate and syntactic constituents in a sentence. In this paper we present a semantic role labeling system for Persian, using memory-based learning model and standard features. Our proposed system implements a two-phase architecture to first identify...

متن کامل

English and Persian Sport Newspaper Headlines: A comparative study of linguistic means

Abstract Using rhetorical figures in specialized languages like the language of newspaper headlines is common. The present study attempted to conduct a contrastive analysis of the English and Persian sport newspaper headlines related to the 2014 FIFA World Cup. Toward this end, a corpus consisting of 400 English and 400 Persian headlines published during 12th of June to 13th of July, 2014 was c...

متن کامل

English and Persian Sport Newspaper Headlines: A comparative study of linguistic means

Abstract Using rhetorical figures in specialized languages like the language of newspaper headlines is common. The present study attempted to conduct a contrastive analysis of the English and Persian sport newspaper headlines related to the 2014 FIFA World Cup. Toward this end, a corpus consisting of 400 English and 400 Persian headlines published during 12th of June to 13th of July, 2014 was c...

متن کامل

Examining the Effect of Increasing Gibson’s Semantic Levels on Non-profit School Students’ Environmental Satisfaction in Tehran’s District

The discussion of man and his alienation in contemporary environments has become an important challenge in environmental psychology studies. Gibson as a theorist in the field of environmental psychology studies, introduces the component of meaning and divides it into six levels, the sixth of which is the ultimate connection between man and the environment. From the audience’s point of view, wha...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2009